# Extraction Methods Documentation

## Overview

This document describes the technical methods used to extract quantitative claims and partnership references from Medicaid MCO procurement documents.

## Software Environment

- **Python**: 3.11
- **Key Libraries**: pandas 2.2.0, numpy 1.26.0, pypdf 5.9.0, python-docx 1.2.0

## Document Processing Pipeline

### Phase 1: Document Acquisition

Documents were obtained from:
1. State procurement portals (direct download)
2. State Medicaid agency websites
3. FOIA requests for non-public documents
4. State legislative archives

### Phase 2: Archive Extraction

Compressed archives (.zip) were extracted using Python's `zipfile` module with recursive handling of nested archives.

### Phase 3: Text Extraction

**PDF Processing**:
```python
from pypdf import PdfReader

def extract_pdf_text(filepath):
    reader = PdfReader(filepath)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text
```

**DOCX Processing**:
```python
from docx import Document

def extract_docx_text(filepath):
    doc = Document(filepath)
    return "\n".join([para.text for para in doc.paragraphs])
```

### Phase 4: Claim Extraction

Claims were identified using pattern-based extraction:

```python
patterns = [
    # Improvement claims
    (r'(\d+(?:\.\d+)?)\s*(?:percent|%)\s*(?:improvement|increase|reduction|decrease)', 'improvement'),

    # Change claims
    (r'(?:improved?|increased?|reduced?|decreased?|achieved?)\s+(?:by\s+)?(\d+(?:\.\d+)?)\s*(?:percent|%)', 'change'),

    # Quality metric references
    (r'(?:HEDIS|CAHPS|NQF|CMS)\s+(?:measure\s+)?([A-Z0-9\-]+)', 'metric'),

    # Rate statements
    (r'(?:rate|score)\s+(?:of\s+)?(\d+(?:\.\d+)?)\s*(?:percent|%)', 'rate'),

    # Target commitments
    (r'(?:target|goal|achieve)\s+(\d+(?:\.\d+)?)\s*(?:percent|%)\s+(?:by|within)', 'target')
]
```

### Phase 5: Partnership Extraction

Partnership references were extracted using:

```python
partnership_patterns = [
    r'(?:partner(?:ship|ed|ing)?|collaborat(?:e|ion|ing)|contract(?:ed)?)\s+with\s+([A-Z][A-Za-z\s&,]+?)(?:\.|,|\s+to|\s+for)',
    r'([A-Z][A-Za-z\s&]+(?:Hospital|Health|Medical|Foundation|University|Center|Services))'
]
```

## Validation

### Inter-rater Reliability

- Random sample: n=200 claims
- Independent coding by two researchers
- Cohen's kappa: >0.85 for domain codes, >0.75 for evidence types

### Quality Control

1. Manual review of 50 randomly selected documents
2. Verification of text extraction fidelity
3. Spot-checking of pattern match accuracy

## Reproducibility

All extraction code is available in the GitHub repository:
https://github.com/sanjaybasu/medicaid_rfp_analysis

To reproduce the extraction:

```bash
git clone [repository]
cd medicaid_rfp_analysis
pip install -r requirements.txt
python scripts/run_full_analysis.py --input /path/to/documents --output /path/to/output
```

## Known Limitations

1. Legacy .doc files (pre-2007 format) excluded from text extraction
2. Scanned PDFs without OCR layer excluded
3. Pattern extraction may miss claims in non-standard formats
4. Some state documents redacted or partially available
